iSentenizer-μ: Multilingual Sentence Boundary Detection Model

نویسندگان

Derek F Wong

Lidia S Chao

Xiaodong Zeng

چکیده

Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Relevant Sentence Detection Using Reference Corpus

IR with reference corpus is one approach when dealing with relevant sentences detection, which takes the result of IR as the representation of query (sentence). Lack of information and language difference are two major issues in relevant detection among multilingual sentences. This paper refers to a parallel corpus for information expansion and translation, and introduces different representati...

متن کامل

Unsupervised Multilingual Sentence Boundary Detection

In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using thre...

متن کامل

Almost-Unsupervised Cross-Language Opinion Analysis at NTCIR-7

We describe the Sussex NLCL System entered in the NTCIR-7 Multilingual Opinion Analysis Task (MOAT). Our main focus is on the problem of portability of natural language processing systems across languages. Our system was the only one entered for all four of the MOAT languages, Japanese, English, and Simplified and Traditional Chinese. The system uses an almostunsupervised approach applied to tw...

متن کامل

Experiments in Multilingual Sentence Boundary Recognition

David D. Palmer CS Division, 387 Soda Hall #1776 University of California, Berkeley Berkeley, CA 94720-1776 [email protected] Abstract An important step in many multilingual text processing tasks, including sentence alignment, automatic lexicon construction, and machine translation, is the segmentation of texts into individual sentences. In this paper we present the results of experiments...

متن کامل

Multilingual Summarization: Dimensionality Reduction and a Step Towards Optimal Term Coverage

In this paper we present three term weighting approaches for multi-lingual document summarization and give results on the DUC 2002 data as well as on the 2013 Multilingual Wikipedia feature articles data set. We introduce a new intervalbounded nonnegative matrix factorization. We use this new method, latent semantic analysis (LSA), and latent Dirichlet allocation (LDA) to give three term-weight...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2014 شماره

صفحات -

تاریخ انتشار 2014

iSentenizer-μ: Multilingual Sentence Boundary Detection Model

نویسندگان

چکیده

منابع مشابه

Multilingual Relevant Sentence Detection Using Reference Corpus

Unsupervised Multilingual Sentence Boundary Detection

Almost-Unsupervised Cross-Language Opinion Analysis at NTCIR-7

Experiments in Multilingual Sentence Boundary Recognition

Multilingual Summarization: Dimensionality Reduction and a Step Towards Optimal Term Coverage

عنوان ژورنال:

اشتراک گذاری